Fine-Tune Mistral v0.3 with ORPO and Unsloth

Low-Rank Adapter Model Fine-tuning

Large Language Models
Author

Matthias De Paolis

Published

July 30, 2024

Image

ORPO is an advanced fine-tuning method that integrates traditional supervised fine-tuning with preference alignment into a single, streamlined process. This technique significantly reduces the computational resources and time required for training. Moreover, empirical evidence demonstrates that ORPO outperforms other alignment methods across various model sizes and benchmarks. In this article, we will fine-tune the latest Mistral 7B model using ORPO and the TRL library. The code for this process can be found on Google Colab and in the LLM Tutorial on GitHub.

ORPO

Instruction tuning and preference alignment are crucial methods for customizing Large Language Models (LLMs) for particular tasks. Typically, this entails a multi-step process: first, Supervised Fine-Tuning (SFT) on instructions to tailor the model to the desired domain, and second, applying preference alignment techniques such as Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to enhance the probability of producing preferred responses over less desirable ones.

Researchers have discovered a drawback in this method. Although SFT successfully adjusts the model to the target domain, it also unintentionally raises the chances of producing both unwanted and desired answers. Therefore, the preference alignment stage is essential to enlarge the disparity between the probabilities of accepted and rejected outputs.

Image

Hong and Lee (2024) presented ORPO, an innovative approach that combines instruction tuning and preference alignment into a single training framework. ORPO modifies the conventional language modeling objective by incorporating the negative log-likelihood loss with an odds ratio (OR) component. This OR loss applies a mild penalty to rejected responses while greatly rewarding preferred ones, allowing the model to simultaneously learn the target task and align with human preferences.

\mathscr{L}{ORPO} = \mathbb{E}{(x, y_{w}, y_l)}[\mathscr{L}{SFT} + \lambda \cdot \mathscr{L}{OR}]

ORPO has been integrated into key fine-tuning libraries such as TRL, Axolotl, and LLaMA-Factory. The following section will demonstrate its usage with TRL.

Fine-Tuning Mistral v0.3 with ORPO and Unsoth

Mistral AI’s v0.3 is a significant update to their AI model, introducing improved performance and efficiency. This version includes enhanced instruction-following capabilities, making interactions more intuitive. Additionally, Mistral v0.3 incorporates advanced reasoning skills, enabling it to tackle complex tasks more effectively. The update also extends the context length to 32768 tokens, allowing for more detailed and coherent conversations. Technical details include an extended vocabulary (32000 to 32768), a new tokenizer, and support for function calling.

ORPO necessitates a preference dataset that includes a prompt, a selected answer, and a discarded answer. To achieve this, we will utilize llmat/dpo-orpo-mix-38k-balanced, a dataset that merges high-quality DPO datasets and has been further balanced using a clustering-based approach.

To efficiently fine-tune our model we will use the unlsoth library. Unsloth significantly improves speed and efficiency in the training of Large Language Models (LLMs). The speed and efficiency gains are achieved through several optimizations, including manual autograd and chained matrix multiplication. Furthermore, it utilizes Flash Attention via xformers and Tri Dao’s implementation, which is a highly optimized approach to handling attention mechanisms in transformer models. Unsloth makes fine-tuning 2 times faster with 50% less memory usage.

Let’s start by installing the required libraries:

!pip install python-dotenv
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Now let’s login to our W&B workspace

import wandb
import os
import dotenv

dotenv.load_dotenv()
%env WANDB_NOTEBOOK_NAME = $Fine_tune_Mistral_with_ORPO
wandb.login(key=os.environ["WANDB_API_KEY"])

Load the Model and Tokenizer for LoRA

In the following, we will load the Mistral 7B v0.3 model in 4-bit precision using bitsandbytes.

cache_dir = './model'
model_id = 'mistralai/Mistral-7B-v0.3'
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Loading Checks

After loading the model, it’s crucial to ensure that all parameters are correctly placed on the GPU and that none are overflowing onto the CPU. This can be particularly important for large models where memory management is critical.

To verify this, you can iterate through the model’s named parameters and check their device type. If any parameter is on the CPU (indicated by the device type ‘meta’), it will be printed out.

Here is the code to perform this check:

# Check there are no parameters overflowing onto cpu (meta).
for n, p in model.named_parameters():
    if p.device.type=='meta':
        print(f"{n} is on meta!")

Prepare for LoRA fine-tuning

Before starting the LoRA (Low-Rank Adaptation) fine-tuning process, it’s essential to understand which parameters in your model are trainable and which are not. This helps in ensuring that only the desired parameters are updated during training, which is crucial for efficient and effective fine-tuning.

To achieve this, you can use the following function to print the number of trainable parameters in the model and list which parameters are trainable and which are not.

Here is the code to perform this check:

def print_trainable_parameters(model):
    """
    Prints the number of trainablöe parameters in the model and lists which parameters
    """
    trainable_params = 0
    non_trainable_params = 0
    all_params = 0

    print("Trainable Parameters")
    for name, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
            print(f" {name}")
        else:
            non_trainable_params += param.numel()

    print("\nNon-Trainable Parameters:")
    for name, param in model.named_parameters():
        if not param.requires_grad:
            print(f" {name}")

    print(
        f"\nSummary:\n Trainable params: {trainable_params}\n Non-Trainable params: {non_trainable_params}\n All Parameters: {all_params}")
        

Let’s take a look a the model

print(model)

Setting Up LoRA Fine-Tuning

To prepare your model for LoRA (Low-Rank Adaptation) fine-tuning, you need to configure it properly. This involves setting up the LoRA configuration. Here’s a brief overview of the parameters and their best settings:

  1. r: This parameter controls the rank of the low-rank adaptation matrices. It’s suggested to choose a value greater than 0, with common choices being 8, 16, 32, 64, or 128. The best setting depends on the specific use case and computational resources, but a good starting point is 8 or 16.

  2. lora_alpha: This parameter scales the magnitude of the LoRA update. A higher value can lead to more significant changes in the model’s behavior. The best setting is typically 32, as used in the code.

  3. target_modules: This list specifies which modules in the model should be fine-tuned. The best settings include key modules like "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", and "down_proj". If the task involves chat fine-tuning, it’s also beneficial to set "lm_head" (language model head) as trainable.

  4. use_gradient_checkpointing: This parameter activates gradient checkpointing to conserve memory. It is managed by Unsloth, which offloads input and output embeddings to disk, thereby saving VRAM.

  5. random_state: This parameter sets the seed for random number generation, ensuring reproducibility. The best setting is any integer value; in the code, it’s set to 3407.

  6. use_rslora: This parameter activates RSLoRA, which adjusts the scaling factor of LoRA adapters to be proportional to 1/√r instead of 1/r. This adjustment enhances the stability of learning, particularly for higher adapter ranks, and improves fine-tuning performance as the rank increases.

These settings provide a good starting point for fine-tuning a language model using PEFT. However, the optimal settings may vary depending on the specific task and dataset, so some experimentation may be necessary.

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    lora_alpha = 32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head", # Language model head - best to set this trainable if chat fine-tuning
        
    ],
    
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    
)
Set up Tokenizer and Padding

Before starting the fine-tuning process, it’s essential to configure the tokenizer and set up padding correctly. This ensures that the model can handle input sequences efficiently and that special tokens are properly managed.

Here is a step-by-step guide to setting up the tokenizer and padding:

  1. Inspect the Tokenizer: Print out the tokenizer details, including the vocabulary size, beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and chat template.

  2. Optionally Set the Chat Template Manually: If needed, you can manually set the chat template. This is useful for ensuring that the conversation starts correctly depending on the initial message role.

  3. Apply the Chat Template: Use the chat template to format a list of messages.

  4. Set the Pad Token: Determine the appropriate pad token based on the tokenizer’s vocabulary and set it accordingly.

  5. Update the Model Configuration: Ensure that the model and its configuration are updated with the correct pad token ID.

Here is the code to perform these steps:

print(tokenizer)
print(tokenizer.vocab_size)
print(tokenizer.bos_token)
print(tokenizer.eos_token)
print(tokenizer.chat_template)

A custom chat template for a tokenizer, specifically designed for Llama/Mistral models is created. This template ensures that conversations start correctly by conditionally adding a beginning-of-sequence token (bos_token) if the first message is not from the assistant. This is particularly useful when formatting chosen and rejected responses separately, as it avoids adding an extra bos_token before the response.

The template is defined using a Jinja-like syntax, which iterates through the messages and formats them based on their roles (user or assistant). For user messages, it wraps the content with [INST] and [/INST] tags, while for assistant messages, it appends an end-of-sequence token (eos_token).

tokenizer.chat_template = """{% if messages[0]['role'] != 'assistant' %}{{ bos_token }}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}
"""

# Test chat template
messages = [
    {'role': 'user', 'content': 'write a quick sorf algorithm in python.'},
    {'role': 'assistant', 'content': 'here you are.'},
    {'role': 'user', 'content': 'great.'},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=False)
print(inputs)
## set the pad token to <pad>, if not <|pad|>, if not <unk> if <unk>
if '<pad>' in tokenizer.get_vocab():
    print('<pad> token is is in the tokenizer. Usinh <pad> for pad')
    #Set the pad token
    tokenizer.pad_token = '<pad>'
elif '<|pad|>' in tokenizer.get_vocab():
    print('<|pad|> token is in the tokenizer. Using for <|pad|> for pad')
    # Set the pad token
    tokenizer.pad_token = '<|pad|>'
elif '<unk>' in tokenizer.get_vocab():
    print('<unk> token is in the tokenizer. Using for <unk> for pad')
    # Set the pad token
    tokenizer.pad_token = '<unk>'
else:
    print(f'Using EOS token, {tokenizer.eos_token}, for padding. Warning, this ')
    tokenizer.pad_token = tokenizer.eos_token
# Update pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# Check if they are equal
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pat token ID are not equal"

# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.pad_token_id)
print('Model config pad token ID:', model.config.pad_token_id)
print('Number of tokens now in tokenizer:', tokenizer.vocab_size)
print('Special tokens map:', tokenizer.special_tokens_map)
print('All special tokens:', tokenizer.all_special_tokens)
print(tokenizer)

Loading and Preparing the Dataset for Fine-Tuning

In this code, we will guide you through the process of loading and preparing a dataset for fine-tuning a model. This involves loading the dataset, shuffling it, splitting it into training and test sets, and applying a specific template to format the data correctly.

Here is a step-by-step guide to loading and preparing the dataset:

  1. Import Necessary Libraries: Import the required libraries, including json for handling JSON data and datasets for loading and manipulating the dataset.

  2. Define Dataset Parameters: Set the dataset name and the maximum number of samples to use. If you want to use the full dataset, set max_num_samples to None.

  3. Define the build_dataset Function: Create a function called build_dataset that takes a tokenizer, dataset name, cache directory, maximum number of samples, and other parameters as inputs. This function will load the dataset, shuffle it, split it into training and test sets, and apply a specific template to format the data.

  4. Load the Dataset: Use the load_dataset function from the datasets library to load the dataset. The dataset is split based on the max_num_samples parameter.

  5. Shuffle the Dataset: If max_num_samples is not None, shuffle the dataset to ensure randomness.

  6. Split the Dataset: Determine the number of test samples and split the dataset into training and test sets using the train_test_split method.

  7. Apply the DPO Template: Define a function called apply_dpo_template that formats the data according to the DPO (Direct Preference Optimization) template. This function extracts the necessary information from the dataset and applies the chat template using the tokenizer.

  8. Map the Dataset: Use the map method to apply the apply_dpo_template function to the dataset. Remove the original columns and rename the new columns accordingly.

  9. Return the Dataset: Return the training and test datasets.

  10. Check the Chat Template: Ensure that the chat template is correctly applied and that special tokens are not included when tokenizing the responses.

Here is the code to perform these steps:

# Prepared with the help of code from: https://github.com/xfactlab/orpo/blob/main...
import json

# Load the dataset
dataset_name = 'llmat/dpo-orpo-mix-38k-balanced' # Ensure this is defined

max_num_samples = None # Set to None to use the full dataset
#max_num_samples = 10000 # set to None to use the full dataset

from datasets import load_dataset

def build_dataset(tokenizer, data_name, cache_dir=None, max_num_samples=10000, test_size_ratio=0.1):
    # Determin the split specification based on max_num samples
    split_spec = 'train' if max_num_samples is None else f'train[:{max_num_samples}]'

    # Load the dataset
    full_data = load_dataset(data_name, split=split_spec, cache_dir=cache_dir)

    # Shuffle the dataset
    if max_num_samples is not None:
        full_data = full_data.shuffle(seed=42)
    else:
        full_data = full_data

    # Determine the number of test samples
    num_total_samples = len(full_data)
    test_size = int(test_size_ratio * num_total_samples)

    # Randomly split the data into training and test sets
    dataset = full_data.train_test_split(test_size=test_size)

    column_names = list(dataset['train'].features)

    def apply_dpo_template(example):
        # function adapted from https://kaitchup.substrack.com/p/fine-tune-a-better-go
        if all(k in example.keys() for k in ('chosen', 'rejected')):
            # For DPO, the inputs are triples of (prompt, chosen, rejected), where 'chosen'
            # We therefore need to extract the N-1 turns to form the prompt
            prompt_messages = example['chosen'][:-1]
            example['messages'] = example['chosen']

            # Now we extract the final turn to define chosen/rejected responses
            chosen_messages = example['chosen'][-1:]
            rejected_messages = example['rejected'][-1:]
            example['text_chosen'] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example['text_rejected'] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example['text_prompt'] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
        return example

    dataset = dataset.map(apply_dpo_template, remove_columns=column_names,
                desc='Formatting comparisons with prompt template',)

    for split in ['train', 'test']:
        dataset[split] = dataset[split].rename_columns(
            {'text_prompt': 'prompt', 'text_chosen': 'chosen', 'text_rejected': 'rejected', 'messages': 'messages'}
        )

    return dataset['train'], dataset['test']

# Assuming 'tokenizer' and 'dataset_name' are already defined
train, test = build_dataset(tokenizer, dataset_name, cache_dir='./dataset', max_num_samples=max_num_samples)

# Check the chat template!!! <s> should not be included when tokenizing the respones

After preparing and formatting your dataset for fine-tuning, it’s crucial to inspect the data to ensure that it has been correctly processed. This step helps you verify that the prompt, chosen, rejected, and messages fields are properly formatted and contain the expected information.

print('Prompt:', train['prompt'][0])
print('\n\nChosen:', train['chosen'][0])
print('\n\nRejected:', train['rejected'][0])
print('\n\nMessages (incl. prompt):', train['messages'][0])

Setting Up and Running Training

In this tutorial, we will go through the process of setting up and running the training for your model. This includes configuring training parameters, creating a custom logging callback, and initiating the training process.

Here is a step-by-step guide to setting up and running the training:

  1. Set Training Parameters: Define the training parameters such as the model name, number of epochs, gradient accumulation steps, batch size, and the directory to save the results.

  2. Create a Custom Logging Callback: Implement a custom callback to log training metrics to a file. This callback will write the training and evaluation loss to a log file and save the trainable parameters at checkpoint steps.

  3. Initialize the Logging Callback: Create an instance of the custom logging callback with the specified log file path.

Here is the code to perform these steps:

model_name = model_id.split('/')[-1]

epochs=1
grad_accum=4
batch_size=8
fine_tune_tag='ORPO'
save_dir = f'./results/{model_name}_{dataset_name}_{epochs}_epochs_{fine_tune_tag}'
print(save_dir)
import transformers
import os
import torch

# Custom callback to log metrics
class LoggingCallback(transformers.TrainerCallback):
    def __init__(self, log_file_path):
        self.log_file_path = log_file_path

    def on_log(self, args, state, control, model=None, logs=None, **kwargs):
        with open(self.log_file_path, 'a') as f:
            if 'loss' in logs:
                f.write(f'Step: {state.global_step}, Training Loss: {logs["loss"]}\n')
                if 'eval_loss' in logs:
                    f.write(f'Step: {state.global_step}, Eval Loss: {logs["eval_loss"]}\n')
                f.flush()  # Force flush the buffered data to file

        # Check if the current step is a checkpoint step
        if state.global_step % int(args.save_steps) == 0:
            # Check if the last checkpoint path exists
            if state.best_model_checkpoint:
                checkpoint_dir = state.best_model_checkpoint
            else:
                # If not, construct the checkpoint directory path
                checkpoint_dir = os.path.join(args.output_dir, f'checkpoint-{state.global_step}')

            # Ensure the checkpoint directory exists
            os.makedirs(checkpoint_dir, exist_ok=True)

            # Save trainable params in the checkpoint directory
            current_trainable_params = {n: p for n, p in model.named_parameters() if p.requires_grad}
            current_trainable_params_state_dict = {n: p.data for n, p in current_trainable_params.items()}
            file_path = os.path.join(checkpoint_dir, 'trainable_params.pt')
            torch.save(current_trainable_params_state_dict, file_path)

# Log file path
cache_dir = './dataset'  # Assuming cache_dir is defined elsewhere in your code
log_file_path = os.path.join(cache_dir, 'training_logs.txt')

# Create an instance of the custom callback
logging_callback = LoggingCallback(log_file_path)

Setting Up ORPO Training

In this section, we’ll walk through setting up and training a model using the ORPOTrainer from the trl library.

I trained the model on the entire dataset (38k samples) using an RTX 4090 GPU (24 GB of VRAM). The training took 7 hours and 35 minutes. You can use smaller GPUs with less VRAM and a smaller batch size. In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like ‘max_num_samples = 10000’ to only load 10k samples.

Configure ORPO

Define the configuration for the ORPO training. This configuration includes various hyperparameters and settings for training.

from trl import ORPOTrainer, ORPOConfig
from unsloth import is_bfloat16_supported

orpo_config = ORPOConfig(
    beta=0.2,
    save_steps=500, 
    logging_steps=1,
    num_train_epochs=epochs,
    output_dir=save_dir,
    evaluation_strategy='steps', 
    do_eval=True,
    eval_steps=0.2,
    per_device_eval_batch_size=batch_size,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=grad_accum,
    log_level='debug',
    optim='paged_adamw_8bit',
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    max_grad_norm=0.3,
    lr_scheduler_type='linear',
    warmup_ratio=0.03,
    learning_rate=1e-4, 

    max_prompt_length=512,
    max_length=1024,

    max_completion_length=1024,
    remove_unused_columns=True,
    
)

Initialize ORPOTrainer

Create an instance of ORPOTrainer with the model, datasets, tokenizer, and the configuration defined earlier.

orpo_trainer = ORPOTrainer(
    model,
    args=orpo_config,
    train_dataset=train,
    eval_dataset=test,
    tokenizer=tokenizer,

    callbacks=[logging_callback], # Add custom callback here
)

Train the Model

Set the model configuration to avoid cache warnings and start the training process.

model.config.use_cache = False # silence the warnings
orpo_trainer.train()

Plotting Training and Evaluation Losses with Matplotlib

After training your model, it’s important to visualize the training and evaluation losses to understand how well your model is performing and to identify any potential issues. Visualizing the losses can help you diagnose problems such as overfitting or underfitting and make informed decisions about further training or model adjustments.

import matplotlib.pyplot as plt

# Initialize lists to hold training and evaluation losses and steps
train_losses = []
eval_losses = []
train_steps = []
eval_steps = []

# Populate the lists from the log history
for entry in orpo_trainer.state.log_history:
    if 'loss' in entry:
        train_losses.append(entry['loss'])
        train_steps.append(entry['step'])
    if 'eval_loss' in entry:
        eval_losses.append(entry['eval_loss'])
        eval_steps.append(entry['step'])

# Plot the losses
plt.plot(train_steps, train_losses, label='Train Loss')
plt.plot(eval_steps, eval_losses, label='Eval Loss')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
Image

Let’s now check the W&B plots. While the loss goes down, we also can see that the difference between the chosen and rejects answers becomes clearer.

Image

Merging Adapters and Saving the Model to Hugging Face Hub

As a last step, we merge the adapters with the original model using 16-bit precision to enhance quality. Initially, we save it locally in the “model” directory before uploading it to the Hugging Face Hub. The trained model is available at llmat/Mistral-v0.3-7B-ORPO.

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("llmat/Mistral-v0.3-7B-ORPO", tokenizer, save_method="merged_16bit")

Conclusion

This article presented a thorough overview of ORPO fine-tuning and its practical application to a Mistral v0.3 7B model. Utilizing QLoRA’s efficient memory management, we successfully fine-tuned a 7B LLM on a high-quality dataset with minimal GPU resources.

I hope you found this guide helpful. If you liked this article, follow me on Hugging Face @llmat. Best of luck with your model fine-tuning!